Skip to content

fix(ocap-kernel): enforce one delivery per crank, fix rollback cache staleness#879

Merged
rekmarks merged 17 commits intomainfrom
rekm/orphaned-exos
Mar 20, 2026
Merged

fix(ocap-kernel): enforce one delivery per crank, fix rollback cache staleness#879
rekmarks merged 17 commits intomainfrom
rekm/orphaned-exos

Conversation

@rekmarks
Copy link
Member

@rekmarks rekmarks commented Mar 17, 2026

As it turns out, we have been violating the invariant that a crank consists of the delivery of a single message or notification. Since at least the introduction of KernelQueue.ts in #484, one iteration of the kernel's run queue—which should be equivalent to a crank—has actually been able to deliver an unbounded number of messages.

This means that, if a delivery aborts mid-crank, rollbackCrank('start') reverts all deliveries in the crank (including earlier successful ones), creating inconsistency with vat in-memory state and leaving promise subscriptions permanently dangling.

This PR ensures that we correctly implement cranks via the kernel's run queue loop as described below.

Summary

  • Enforce one run-queue item per crank (change while to if in KernelQueue generator) and fix stale StoredQueue caches after rollbackCrank by refreshing the run queue and invalidating runQueueLengthCache
  • Reject JS promise subscriptions when a crank aborts with vat termination; fix terminateVat callback in Kernel to avoid deadlock by bypassing VatManager.terminateVat() (which calls waitForCrank())
  • Simplify the run queue implementation; in lieu of an async generator + loop, use a single loop with helper functions
  • Improve error messages for splat cases (revoked, no owner, no object, endpoint gone) and handle vanished endpoints in KernelRouter delivery
  • Fix SubclusterManager to catch rejected bootstrap promises
  • Add orphaned ephemeral exo tests (unit + e2e)
  • Glossary formatting and crank definition correction

Test plan

  • Existing unit tests updated and passing (KernelQueue.test.ts, KernelRouter.test.ts, crank.test.ts, syscall-validation.test.ts, vat-lifecycle.test.ts)
  • New unit test for orphaned ephemeral exos (orphaned-ephemeral-exo.test.ts)
  • New e2e test for orphaned ephemeral exos (orphaned-ephemeral-exo.test.ts in kernel-node-runtime)

🤖 Generated with Claude Code


Note

High Risk
High risk because it changes core KernelQueue/KernelRouter crank semantics, rollback behavior, and how message failures propagate (resolve vs reject), which can affect delivery ordering, retries, and many callers/tests.

Overview
Kernel crank semantics are tightened and error propagation is made consistent. KernelQueue.run is rewritten to process exactly one run-queue item per crank, and JS-side subscriptions created by enqueueMessage now support both resolve and reject so rejected kernel promises reject the returned promise.

Rollback and termination handling are hardened. rollbackCrank now refreshes the stored run-queue and invalidates length caches to avoid stale in-memory state after DB rollback, and abort+terminate paths immediately reject the aborted send’s subscription. Kernel vat termination during a crank bypasses terminateVat() to avoid deadlock.

Message “splat” cases are clearer and better handled. KernelRouter improves errors for revoked/no-owner/no-object/endpoint-gone cases, resolves splat rejections using the current promise decider, and treats vanished endpoints as a splat with promise rejection.

Tests/docs updated and expanded. Many tests are updated to expect promise rejections (including remote comms, revocation, lifecycle), new unit+e2e coverage is added for orphaned ephemeral exos across vat restart, kernel-utils exports a new isCapData guard used to rethrow bootstrap errors as real Errors, and the glossary is expanded/clarified (kernel promises/decider/crank definition).

Written by Cursor Bugbot for commit 233587c. This will update automatically on new commits. Configure here.

rekmarks and others added 4 commits March 17, 2026 12:41
…staleness

- Restructure run queue generator to yield exactly one item per
  startCrank/endCrank pair, preventing rollback from undoing
  unrelated earlier deliveries in the same crank
- Refresh StoredQueue after rollback so cached head/tail pointers
  are re-read from DB, fixing dequeue returning undefined
- Invalidate runQueueLengthCache after rollback
- Bypass VatManager.terminateVat() in KernelQueue callback to avoid
  waitForCrank() deadlock when terminating from within a crank
- Handle vanished endpoints in KernelRouter.deliverSend with
  try/catch, treating as splat instead of crashing
- Change KernelQueue subscriptions to {resolve, reject} so aborted
  sends can reject the caller's JS promise immediately
- Distinguish rejected vs fulfilled in invokeKernelSubscription
- Improve splat error messages to describe cause without leaking
  internal identifiers (krefs, endpoint IDs)
- Add integration test for orphaned ephemeral exo rejection
- Standardize KernelQueue test loop-exit pattern using sentinel

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rekmarks rekmarks requested a review from FUDCo March 17, 2026 23:59
@rekmarks rekmarks marked this pull request as draft March 18, 2026 00:06
@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 77.36%
⬇️ -0.05%
7870 / 10173
🔵 Statements 77.17%
⬇️ -0.05%
7995 / 10360
🔵 Functions 75.24%
⬇️ -0.11%
1891 / 2513
🔵 Branches 75.02%
⬆️ +0.07%
3232 / 4308
File Coverage
File Stmts Branches Functions Lines Uncovered Lines
Changed Files
packages/kernel-test/src/vats/orphaned-ephemeral-consumer.ts 0% 100% 0% 0% 14-20
packages/kernel-test/src/vats/orphaned-ephemeral-provider.ts 0% 100% 0% 0% 11-19
packages/kernel-ui/src/components/SendMessageForm.tsx 100%
🟰 ±0%
72.72%
⬇️ -2.28%
100%
🟰 ±0%
100%
🟰 ±0%
packages/kernel-utils/src/index.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/kernel-utils/src/types.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/ocap-kernel/src/Kernel.ts 88.18%
⬆️ +1.03%
77.77%
🟰 ±0%
82.6%
⬆️ +2.17%
88.18%
⬆️ +1.03%
286-289, 306, 330, 398-408, 500, 568, 634-637, 650, 660-661, 704, 721
packages/ocap-kernel/src/KernelQueue.ts 98.23%
⬆️ +0.10%
90.62%
⬆️ +1.95%
100%
🟰 ±0%
98.23%
⬆️ +0.10%
90, 351
packages/ocap-kernel/src/KernelRouter.ts 84.44%
⬇️ -5.72%
73.13%
⬇️ -2.25%
100%
🟰 ±0%
84.44%
⬇️ -5.72%
110, 169, 183, 235-258, 264, 291-300, 307, 353, 368, 371
packages/ocap-kernel/src/types.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/ocap-kernel/src/remotes/kernel/RemoteHandle.ts 88.5%
🟰 ±0%
82.56%
🟰 ±0%
87.5%
🟰 ±0%
88.77%
🟰 ±0%
347, 366-409, 462, 505, 515-517, 558-571, 910, 983, 1029
packages/ocap-kernel/src/store/index.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/ocap-kernel/src/store/types.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/ocap-kernel/src/store/methods/crank.ts 100%
🟰 ±0%
93.75%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/ocap-kernel/src/vats/SubclusterManager.ts 95.07%
⬇️ -1.30%
88.88%
⬇️ -2.92%
100%
🟰 ±0%
95%
⬇️ -1.32%
194-197, 251, 334, 339-341, 357, 361
packages/ocap-kernel/src/vats/VatHandle.ts 90%
⬆️ +4.29%
85.71%
⬆️ +3.57%
100%
🟰 ±0%
90%
⬆️ +4.29%
305, 356-361, 367-373
Generated in workflow #3971 for commit 233587c by the Vitest Coverage Report Action

Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see what was going wrong here, and if my analysis is right, this PR does not address it at all. The issue is that the crank transaction is being driven inside the runQueueItems generator function, which is only responsible for pulling the next item off the run queue. The actual delivery happens in run, which iterates over the stream produced by runQueueItems, but the latter commits the transaction before the delivery has even happened. Somehow the refactoring that moved the run queue processing loop out of Kernel.ts and into KernelQueue.ts mangled this. I don't understand by run is in KernelQueue.ts at all.

// Queue empty — sleep until woken
const { promise, resolve } = makePromiseKit<void>();
if (this.#wakeUpTheRunQueue !== null) {
Fail`run queue already waiting to be woken; cannot sleep again before the previous wake handler is consumed`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it would be both technically wrong and wildly inappropriate, I somehow feel the urge to change the error message to "Can't sleep. Clowns will eat me."

Simplifies the implementation of the kernel's run loop in a purely
behavioral refactor. The previous async generator + loop iteration has
been unwrapped into a single loop with multiple helper functions. I
noticed that the startCrank() call is the only part of the run loop that
can throw an uncaught exception, and made a note to investigate that
later.

An unrelated TODO comment is also added to the kernel router.
@rekmarks rekmarks marked this pull request as ready for review March 19, 2026 04:16
@rekmarks rekmarks requested a review from a team as a code owner March 19, 2026 04:16
@rekmarks
Copy link
Member Author

rekmarks commented Mar 19, 2026

After further consideration, @FUDCo and I concluded that the run queue implementation was correct as of 46b674d, but the loop + async generator was difficult to reason about. b56cffc attempts to address this by moving to a single loop with helper functions.

@rekmarks
Copy link
Member Author

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

*
* @yields the next item in the run queue.
*/
async *#runQueueItems(): AsyncGenerator<RunQueueItem> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crank lifecycle wraps empty queue check unnecessarily

Medium Severity

startCrank() and createCrankSavepoint('start') are called before checking whether the queue has an item. When the queue is empty, a phantom crank is started, a DB savepoint is created, and then endCrank() immediately releases it — all without any delivery. This violates the PR's stated invariant that a crank consists of exactly one delivery. The startCrank/createCrankSavepoint calls belong inside the else branch (after confirming a queue item exists), not before the emptiness check. As @FUDCo noted: "It's ending the crank before the delivery even happens."

Additional Locations (1)
Fix in Cursor Fix in Web

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately #getNextRunQueueItem() mutates the kernel store if there is an item on the run queue, so we have to start the crank and create the save point before calling it.

this.#kernelStore.startCrank();
let wakeUpPromise: Promise<void> | undefined;

this.#kernelStore.startCrank();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startCrank() and endCrank() (down in the finally block) are the only parts of the run loop that can throw uncaught errors. This behavior was pre-existing. Is it what we want?

@rekmarks rekmarks requested a review from FUDCo March 19, 2026 05:40
Comment on lines +123 to +129
if (this.#kernelStore.runQueueLength() > 0) {
const item = this.#kernelStore.dequeueRun();
if (item) {
return item;
}
}
return undefined;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATTN: in the previous implementation, GC actions, reap actions, and run queue items were processed in this loop:

  1. All GC actions
  2. All reap actions
  3. All run queue items

Now, they are processed in this loop:

  1. All GC actions
  2. All reap actions
  3. One run queue item

Everything obviously appears to work, but we should also convince ourselves that it's correct.

Add "kernel promise" entry distinguishing kernel promises from JS
promises, and "decider" entry with function call analogy. Update
existing entries to specify "kernel promise" where applicable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FUDCo
FUDCo previously approved these changes Mar 20, 2026
Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on reviewing the recent commits and our conversation earlier today, I think this is OK to go. I do have one pedantic quibble about a glossary entry. Feel free to leave it as is and take care of it later, of if you want to fix it now I promise quick turn around on a rubber stamp.

docs/glossary.md Outdated
runtime environment for vat code and handles object persistence, promise management, and
[syscall](#syscall) coordination.
runtime environment for vat code and handles object persistence, [kernel
promise](#kernel-promise) management, and [syscall](#syscall) coordination.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Liveslots doesn't actually manage kernel promises at all, the kernel does (that's why they're called kernel promises). The only promises that get exposed outside the vat by liveslots do, in fact, turn into kernel promises, but liveslots doesn't know anything about this.

Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

G2G

@rekmarks rekmarks added this pull request to the merge queue Mar 20, 2026
Merged via the queue into main with commit c1464ed Mar 20, 2026
30 checks passed
@rekmarks rekmarks deleted the rekm/orphaned-exos branch March 20, 2026 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants